Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian

نویسندگان

Marco Baroni

Silvia Bernardini

Federica Comastri

Lorenzo Piccioni

Alessandra Volpi

Guy Aston

Marco Mazzoleni

چکیده

This paper describes the La Repubblica corpus, currently being developed at the SSLMIT of the University of Bologna. The corpus is a very large collection of newspaper text, currently amounting to 175 million words, but expected to grow to 400 million before the end of 2004. When completed, it will contain all the articles published between 1985 and 2000 by the national daily La Repubblica. The paper discusses the techniques used to extract the text, tokenize it and annotate it (basic TEI annotation, POS tagging, genre/topic categorization), it presents examples of how it can be used, and gives details of the ways in which interested users can access it. The paper concludes with a discussion of current and future developments, and of weak and strong points of this resource.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

EMOCause: An Easy-adaptable Approach to Extract Emotion Cause Contexts

In this paper we present a method to automatically identify linguistic contexts which contain possible causes of emotions or emotional states from Italian newspaper articles (La Repubblica Corpus). Our methodology is based on the interplay between relevant linguistic patterns and an incremental repository of common sense knowledge on emotional states and emotion eliciting situations. Our approa...

متن کامل

Sequenze N+pN (Nome Comune + Nome Proprio): Descrizione Linguistica da un Corpus dell'Italiano (N+pN (Common Noun + Proper Noun) Sequences: Linguistic Description from an Italian Corpus)

English. This paper describes the most important N+pN (noun + proper noun) structures in Italian from the corpus of La Repubblica 2002-2005 Italiano. Il contributo descrive le strutture N+pN più significative estratte dal corpus de La Repubblica 2002-2005.

متن کامل

Development of a Corpus Workbench for the METU Turkish Corpus

We will introduce a corpus workbench designed and implemented for the METU Turkish Corpus. The workbench design introduces a number of useful features and the workbench itself is basically usable with any TEI and XML compliant corpus, provided that it can be indexed in the format required by the workbench.

متن کامل

TEI P5 as an XML Standard for Treebank Encoding∗

The aim of the paper is to show that a subset of Text Encoding Initiative Guidelines is a reasonable choice as a standard for stand-off XML encoding of syntactically annotated corpora. The proposed TEI schema — actually employed in the National Corpus of Polish — is compared to other such candidate standards, including TIGER-XML, SynAF and PAULA.

متن کامل

A Corpus of Textual Revisions in Second Language Writing

This paper describes the creation of the first large-scale corpus containing drafts and final versions of essays written by non-native speakers, with the sentences aligned across different versions. Furthermore, the sentences in the drafts are annotated with comments from teachers. The corpus is intended to support research on textual revision by language learners, and how it is influenced by f...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian

نویسندگان

چکیده

منابع مشابه

EMOCause: An Easy-adaptable Approach to Extract Emotion Cause Contexts

Sequenze N+pN (Nome Comune + Nome Proprio): Descrizione Linguistica da un Corpus dell'Italiano (N+pN (Common Noun + Proper Noun) Sequences: Linguistic Description from an Italian Corpus)

Development of a Corpus Workbench for the METU Turkish Corpus

TEI P5 as an XML Standard for Treebank Encoding∗

A Corpus of Textual Revisions in Second Language Writing

عنوان ژورنال:

اشتراک گذاری